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A HIERARCHICAL BAYESIAN APPROACH TO RECORD 
LINKAGE AND POPULATION SIZE PROBLEMS 1 

By Andrea Tancredi and Brunero Liseo 
Sapienza Universita di Roma 

We propose and illustrate a hierarchical Bayesian approach for 
matching statistical records observed on different occasions. We show 
how this model can be profitably adopted both in record linkage prob- 
lems and in capture-recapture setups, where the size of a finite pop- 
ulation is the real object of interest. There are at least two important 
differences between the proposed model-based approach and the cur- 
rent practice in record linkage. First, the statistical model is built up 
on the actually observed categorical variables and no reduction (to 
0-1 comparisons) of the available information takes place. Second, 
the hierarchical structure of the model allows a two-way propaga- 
tion of the uncertainty between the parameter estimation step and 
the matching procedure so that no plug-in estimates are used and 
the correct uncertainty is accounted for both in estimating the pop- 
ulation size and in performing the record linkage. We illustrate and 
motivate our proposal through a real data example and simulations. 

1. Introduction. The current explosion in the availability of data from 
multiple sources, and the relative ease of information storage have led to 
a great popularity of statistical methods which aim at merging and/or 
matching statistical information available from different sources. Among 
these methods, record linkage refers to the problem of identifying statis- 
tical units which may be present in more than one data set. Fienberg and 
Manrique-Vallier (2009) review the relevance of record linkage procedures in 
official statistics and highlight the significant intertwins with missing data 
and multiple systems estimation literature. 

The gist of this paper is the proposal of a hierarchical Bayesian frame- 
work which can be profitably adopted both in record linkage problems and 
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in capture-recapture scenarios, where the size of a finite population is the 
main object of interest and the number of "re-captured" individuals is un- 
known. Most of the current approaches to population size estimation with 
matching uncertainty consider the matching and the size estimation as two 
logically well separated steps. Remarkable recent exceptions are Link et al. 
(2009) and Wright et al. (2009) where genotype misidentification is embed- 
ded into multiple mark-recapture models for estimating animal abundance 
using DNA samples. More generally, in this paper, we propose a unified 
framework where matching uncertainty is naturally accounted for in estimat- 
ing population size by using samples of multivariate categorical variables. 

To motivate our approach, consider the following example, which is a part 
of a real application. Suppose we have two data sets which we call A and B 
with sizes, respectively, 34 and 45. Data set A comprises all the foreign resi- 
dents observed in a small census block during the 2001 Italian census popula- 
tion survey (CPS). Data set B comprises all the resident foreigners observed 
in the same census block during the post enumeration survey (PES) 2 . Both 
data sets report, among others, the following variables: (1) first two conso- 
nants of the family name, (2) gender and (3) education level. Assume that 
the three variables represent the only available information to perform the 
match; assume also that the goal is the estimation of N , the total number 
of foreign residents in the census block. The usual approach to this problem 
would be to search for the pairs of units, belonging to different files, which 
agree perfectly on each observed variable. In our example there are 25 pairs 
which show a complete agreement. If we assume that we actually observed 
25 recaptures, such information can be used easily in a capture-recapture 
model to make inference on N. However, two complications may arise. First, 
it might be possible that two different units genuinely agree on each variable. 
Second, because of measurement error, observed records for the same unit 
might be different in the two sampling occasions. They could also agree as 
before, even if they refer to different units with different true values. We will 
discuss this example below in more detail. For the moment, Table 1 summa- 
rizes, for the different choices of the declared number T of recaptures, the 
posterior distribution of N assuming a noninformative prior p(N) oc 1/N 2 

and a hypergeometric likelihood function p(T\N) ex. ( n T ) ( n B™ T ) / („s) with 
n A = 34 and n B = 45. One can see that slightly different choices of T may 
produce dramatically different posterior distributions. 

Accounting for matching uncertainty has relevance well beyond size esti- 
mation problems and relates to the more general problem of inference with 
integrated data; see Judson (2007). In this context, an important exem- 
plification is provided by Lahiri and Larsen (2005) who take into account 



2 PES is usually performed some time after CPS, to evaluate the effective coverage of 
CPS. 
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Table 1 

Posterior quantiles for N with the distribution 
p(N\T) oc ( n T A ) (^b"t)/C^) x (V^ 2 ) with n A = 34, n B = 45 and different 

choices of T 
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linkage uncertainty in the framework of the linear regression model when 
the response variable and the covariates are recorded on two different oc- 
casions. However, our approach can also be applied when there is not yet 
a scheduled statistical analysis to be performed on the linked data, but the 
linkage procedure is just the initial step to obtain a larger and integrated 
reference data set. 

Statistical methods for finding entries related to the same entity in two 
or more files are employed in many different disciplines, such as medicine, 
business administration and official statistics [see, e.g., Herzog, Scheuren 
and Winkler (2007)]. In these contexts it may happen that a unique data 
set with all the necessary information for a particular statistical analysis is 
not available. Furthermore, time and cost constraints may make it unfeasible 
to create such a data set anew. Integration at the unit level of different data 
sets (sample surveys and/or administrative data sets) may be an answer to 
this kind of problem. A considerable difficulty in this context is represented 
by the lack of a unique identifier in the different data sets for each unit of 
interest. In fact, when a set of observed variables (key variables, henceforth) 
may be used as an identifier for connecting records that refer to the same 
unit, particular attention should be paid to errors, as we have seen in the 
introductory example, and missing values. 

To handle the record linkage process, many different methodologies have 
been introduced. Some methods are naive, or heuristic, that is, are based 
only on common sense [e.g., the "iterative method" described in Armstrong 
and Mayda (1993)]. In a fundamental paper, Fellegi and Sunter (1969) put 
these kinds of problems into a firm, model based, statistical framework. Fur- 
ther advances were described in a number of papers in the 1980s and 1990s: 
among others, Jaro (1989), Winkler (1993) and Belin and Rubin (1995). 
Larsen and Rubin (2001) introduce the representation of the record linkage 
problem in terms of the mixture model [see also Larsen (1999)]: this idea has 
been exploited in many other papers; see, for example, Fortini et al. (2001), 
McGlincy (2004) and Larsen (2004) who tackle the problem from a Bayesian 
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perspective. All of these papers assume that each single comparison between 
records in two different files provides new information, independently of the 
other comparisons. This assumption, as noted by Kelley (1986), is funda- 
mentally unsound, as illustrated in Section 2. Also, in this respect, Winkler 
(2000) states that ". . . because the underlying true probabilities have not been 
accurately estimated, estimated error rates ( of the record linkage procedure ) 
are not accurate. " 

An important feature of our paper is that we propose a Bayesian model 
which is based on the actual observed data rather than comparisons. In 
a similar spirit, Fortini et al. (2002) discussed these ideas in the simple 
setting of a single continuous variable. Here we will assume that our key 
variables will be discrete, as almost always happens in practice. 

Record linkage is not the only statistical problem where matching issues 
are concerned. In a bioinformatics context Green and Mardia (2006) intro- 
duce a matching matrix (very similar to our matrix C, see later) into some 
problems of shape analysis, where configurations of points in space need to 
be matched and the points are not completely labeled. 

DeGroot and Goel (1980) consider the situation where a random sample 
of size n, say, (Xj,Zj), i = l,...,n, is drawn from a bivariate normal dis- 
tribution; however, before the sample values are recorded, each observation 
[xi\Zi) gets broken into two separate components. As a consequence, the 
available information is represented by the vectors x = (xi, . . . , x n ) and y = 
(yi , . . . , y n ) , where y is an unknown permutation of the values {z\ , . . . , z n ). 

Another matching example is discussed in Lindley (1977), in a forensic 
framework. Here the matching problem arises when some material is found 
at the scene of a crime and similar material is found on a suspect; in both 
cases material collection is subject to measurement error. Lindley describes 
a Bayesian method to establish whether the two materials come from the 
same source or not. When rephrasing Lindley 's approach from a record link- 
age perspective, we note that that paper was the first attempt to introduce, 
into a Bayesian linking model, the natural idea that two units with the same 
surname are more likely to be a match if the surname is Bodolomonogoto 
than if the surname is Smith. Similar suggestions can be found in the seminal 
papers by Newcombe et al. (1959) and Fellegi and Sunter (1969). 

The paper is structured as follows. In Section 2 we present the standard 
approach to record linkage. Our Bayesian approach is discussed in Section 3. 
Markov chain Monte Carlo (MCMC) methods are needed for estimating 
the parameters of the model. In Section 4 we describe a suitable algorithm 
for simulating the posterior distribution. We also discuss a loss function 
approach to the matching estimation. In Section 5 the performance of the 
methodology is evaluated through a small illustrative application. A more 
realistic example is shown in Section 6. A simulation study is conducted in 
Section 7. Finally, in Section 8 we give a brief discussion of possible future 
extensions and improvements of the method. 
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2. Classic approach to record linkage. Suppose we are given two record 
configurations x A and x B of different sizes n A and n B with 

x A = (x A , . . . , x A , . . . , x A A )' and x B = (xf , . . . ,x B , . . . ,x B B )' . 

Here x A = (x Al , . . . , x Ah ) and x B = (x Bl , . . . , x Bh ) are the observed values 
of a categorical random vector x = (x 1 , . . . ,x h ) whose support is the set 

v = { v hh,-,h = ( v ji> v h>- ■ • ' w iJ' 3i = l, • • • ,h; ■ ■ ■ ;jh = l, • • • , M- 

In the following, the two data configurations will be called, respectively, 
sample A and sample B, the components of the random vectors x (whenever 
it is possible we will avoid subscript and superscript indices to simplify the 
notation) are the key variables and the elements of the set V arranged in 
lexicographic order will be indicated with Vj for j = 1, . . . , k = ki -k^-'-kh- 
Let A x B be the set of all possible pairs of units belonging to different 
samples. Set A x B = M U U, where M = {(a, b) £ A x B :a = b} (here a = b 
means that unit a of sample A and unit b of sample B are the same pop- 
ulation unit) and U = {(a, b) G A x B:a^b}. Probabilistic record linkage, 
as implemented, for example, in Jaro (1989), is performed by modeling the 
comparison vectors y a ^ = (yl a &, . . . , y^ b ) where 

Vectors y a b, a = 1, . . . ,n A , 6=1,... ,n B , are assumed independent condi- 
tionally on M and U. The probability distribution of y a b depends on the 
match or nonmatch status of the single pair (a, 6); in particular, it is as- 
sumed that p(y a b\(a,b) e M) = l\i=imf ab (l - m^-yab and p{y ab \{a,b) G U) 

= JliLi v% ab (1 — Ui) 1 ^^ (here and later, we will abuse notation by letting the 
arguments define the functions) with m = (mi, . . . , m^) and u = (u±, . . . , u/j) 
as unknown probabilities vectors. In addition, the elements of the sets M 
and U are modeled assuming that each pair in A x B is a match with prob- 
ability w, independently of all the other pairs. This way the comparison 
vectors y a b are independent and identically distributed as a mixture of two 
multivariate Bernoulli distributions: 

h 

p(yab\m,u,w) = w JJm^ a6 (l — mj) 1_y "i' 
i=i 

(2.1) 

h 

+ {i-w)l[uf« b (i-u i ) 1 -y«K 
i=i 

Models similar to (2.1) are often used also in biostatistics, under the name 
of a latent class model, to assess diagnostic test accuracy in the absence of 
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a gold standard and only multiple imperfect tests are available [Pepe (2003)] . 
Likelihood maximization of the parameters in model (2.1) is performed via 
the EM algorithm and analytical expressions for the estimators are provided 
by Fellegi and Sunter (1969) and Pepe and Janes (2007) in the case h = 3. 

Several extensions of this basic setup have been proposed; see, for exam- 
ple, Larsen and Rubin (2001). In order to decide whether to declare a link 
a single pair, one can consider the likelihood ratio 

(22) A P(yab\{a,b) G M) nti^fr(i - m) 1 -^ 

P(yab\(a,b)£U) nil -^i) 1 ""- 1 
or the posterior probability 
p{(a,b) £ M\y ab ) 

(2.3) 

wUti m f b ( l - mi) 1 -^ + {l-w) nil uf h {\ - Ul ) l -y« b 

Pairs with high values of A or p((a,b) G M\y ab ) are then declared matches. 
This approach is formalized in the classical approach of Fellegi and Sunter 
(1969). 

In our opinion the above approach can be criticized on several grounds: 

1. Decision rules for classifying records as matches. In general, all the pairs 
with a likelihood ratio A, or a posterior probability, above a fixed thresh- 
old are declared matches. However, the choice of the threshold can be 
problematic, as illustrated, for example, in Belin and Rubin (1995). More 
details about this point will be given in Section 4. 

2. Avoiding multiple matches. Current approaches to record linkage assume 
that there are no duplications in the same file and inference procedures 
should account for that. However, in classical procedures, it might hap- 
pen that a single record in A is linked to more than one record in B; 
consequently, some extra assumptions are necessary. Jaro (1989) pro- 
poses a linear programming approach after a preliminary match estima- 
tion step. An alternative approach [Fortini et al. (2001)], which will be 
pursued here, incorporates the constraints into the sampling model. 

3. Incorporating sampling information. If we assume that the two files are 
random samples without replacement from a population of unknown 
size N, an obvious prior assumption is p((a,b) G M) = 1/N, with N > 
max {n^n^}. In addition, if we know that two units assume the same 
value Vj , the matching probability becomes p((a, b) G M) = 1/Fj, where Fj 
is the (unknown) total number of units with record Vj in the population. 
In record linkage procedures, in general, sources of knowledge of this 
type are not included in the model, with an obvious loss of information. 



BAYESIAN MATCHING AND POPULATION SIZE 



7 



This may be particularly important for applications of record linkage in 
disclosure literature. 

4. Comparison vectors are not independent. In this respect Kelley (1986) 
states: ". . . The decision procedure . . . was developed under the hypothesis 
that the comparison vectors between separate record pairs are independent. 
However, since the record pairs that are considered for possible matches 
are elements of the cross product of the two files we are attempting to 
match, the comparison vectors are in fact dependent . ..." As a matter 
of fact, the random variables y a b are deterministically dependent. To see 
that, consider the case of one key variable X\. Suppose that x^ 1 = xf 1 
and x^ 1 = x Bl • If, in addition, x^ 1 = x B 1 , it must necessarily be true 
that x^ 1 = x Bl , that is, in terms of comparisons, p(y22 = = l 5 yi2 = 
1)2/21 = 1) = 1. Moreover, the problem of dependency among the j/ a b's 
cannot be circumvented by eliminating redundant comparisons in the 
likelihood function, because the order in which pairs are considered would 
matter! 

5. The components y l ab of the comparison vector may not be independent 
conditionally on M and U. The conditional independence assumption 
among the key variables often fails in practice: disagreement on different 
key variables for a true match might be caused by a unique reason which 
introduces correlation among the y„ fe 's. In the absence of conditional in- 
dependence, the resulting estimates of w, m and u lose their meaning and 
a more sophisticated conditional dependence structure must be specified. 
Similar arguments have been applied to criticize the use of model (2.1) for 
the analysis of diagnostic test performance without a gold standard and, 
in this context, several solutions have been proposed and discussed [Al- 
bert and Dood (2004); Pepe and Janes (2007)]. Larsen and Rubin (2001) 
have introduced interactions among key variables; see also Winkler (1995) 
and references therein. 

3. The new model. We assume that the records in x A and x B are mea- 
surements subject to recording error of a multivariate categorical variable 
/i = (/il, . . . , fi h ) whose support is, on both occasions, the set V. Specifically, 
let 

/i A = (/^,...,^,...,4*0' and fi B = (fi B ,..., t i B ,..., fJ , B B )' 

be two independent random samples from the multivariate categorical vari- 
able \i drawn on different occasions from the same finite population. Let 
Mo = (^'.•••.^).« = l,...,n A and /if = (/if \ . . . ,/if h ), b = 1, . . . ,n B be 
the unobserved true values for unit a in sample A and unit b in sample B. 
We assume that, conditionally on their respective true values and a pa- 
rameter vector ;S = (f3\, . . . ,/3^) which accounts for the measurement error, 
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x A and x B are independent, that is, 

p(x A , x B \p A ,p B ,f3) = p(x A \p A , P)p{x B \p B , (3); 

we also assume that, in each sample, all the observations are conditionally 
independent given their true values and ft Then 

p(x A \fi A ,P) = l[p(xt\f4 t l3), p(x B \^ B ,P) = ]J P (x B \n B ,P), 

a=l 6=1 

with 

h h 

p(x A \fj, A , p) = ~[[p(x A > iMaSA), p(xf\fx B ,/3) = l[p(x Bi \fx Bi , ft). 
1=1 1=1 

Note that the vectors \i A and pP introduce a first latent structure into our 
record linkage model and make it effectively a missing data model [Fienberg 
and Manrique-Vallier (2009)]. 

We conclude the top stage of the hierarchical structure by explicitly in- 
troducing the measurement error model. A general model for potentially 
misclassified observed records can be formulated as p(x l = v)]^ = v 1 ,), for 

all Such a model has been considered, in a Bayesian framework, 

by Swartz et al. (2004) who discuss several identifiability problems, and 
by Perez et al. (2007), where strong prior information is introduced in the 
model. Here, to maintain the number of parameters in the model reasonably 
low, we propose a simpler version of the so-called hit-miss model [Copas and 
Hilton (1990)] 

p^ = v). |// = v),) = Pil{v). = v),) + (1 - Pi)/k i} i = l,...,h, 

where ft represents the probability of observing the true value for the ith 
variable "not by chance" and ki is the number of levels of variable x % . This 
way, conditionally on the unobserved true values, each single record field 
can be modeled as a mixture of two components: the first component is 
concentrated on the true value, while the second one is uniformly distributed 
over the set v % = {v\, . . . ,v l k .}. For a recent implementation of the hit-miss 
model see also Noren, Orre and Bate (2005). 

We now specify the conditional distributions of p, and fi B . In particular, 
we assume that p A and \i B are two independent simple random samples 
drawn without replacement from a finite population of unknown size N. 
The unknown vector F = (Fx, . . . ,Fj, . . . ,Ff.), k = Hi=i represents the 
population counts for each element Vj of the set V. Obviously, Ylj=x Fj = N. 
In principle, one can write the model for the unobserved true values p A 
and \x B in the following natural way: 

(3.1) p(p A ,p B \F)=p(p A \F)p(p B \F) 
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with 



(3.2) p(p s \F) 



n 



f S fS 



-1 



N 



-i k 

n 

3 



1 ^3 



S = A,B, 



where f s = (ff, ...,fj,..., /£), S = A, B, are the true sample counts (which 
are, however, unobservable, due to measurement error) for each element 
Vj G V. Formula (3.2) can be obtained by noticing that the observed values 
of p s determine the frequencies f s , so that p(p s \F) = p(p s \f s , F)p(f s \F) 
where p(p s \f s ,F) and p(f s \F) correspond to the two terms in (3.2). The 
usual constraints < f? < Fj, S = A, B, must hold. 

An alternative way of writing the above model is based on the use of two 
latent quantities, which will play a crucial role in our approach. The first 
quantity is the so-called matching matrix C. This is a n A x n B matrix whose 
generic element C ao is a Bernoulli random variable indicating whether or not 
unit a in sample A and unit b in sample B are the same unit, that is, 



Cab 



1, if (a, 6) GM, 
0, if (a, &)€£/. 

The matrix C is the actual quantity of interest in record linkage problems; 
a similar structure also appears in different statistical problems, such as the 
Bayesian alignment [Green and Mardia (2006)] or microarrays analysis [Do, 
Mueller and Tang (2005)]. We assume that multiple matches are not possi- 



ble. This implies that £ a C ab < 1 V6 = 1, . . . , n B , £ 6 C ab < 1 Va = 1 

^ x ,„B 
T 



,n 



A. 



also, note that there are ) (*^ )T! different C matrices with exactly T 
Sa6 C a b matches, T < min(n y 



The other latent quantity we introduce is the vector t = (t\, . . . ,tj, . . . ,tk) 
denoting, for each element of V, the number of matches having Vj as the 
true value. The vector t (which is basically needed to facilitate the simu- 
lation of the posterior distribution, as outlined in the following section) is 
a deterministic function of fi A ,p, B and C. 

Consider, as an illustration, the case where \x is univariate and V = {v%, v%, 
V3,v 4 }: suppose we have fi A = (vi,v 2 ,vi), pP = (v2,V3,vi,v 2 ), with C\ 3 = 
C24 = 1 and all the other elements of C equal to 0; then t = (1,1,0,0). 
Finally, notice that < t j < min{ff, ff] Vj = l,...,k and £j =1 t s = T. 
Now we introduce the model assumptions for the conditional distribution 



of p and p given the values of t,C and F. First, note that p(p , p \t, F, 
C) = when p A ^ p B and C a b = 1- Also, we have p(p A , H B \t, F, C) = either 
when min{f A , f B } < tj or max{f A , f B } > Fj. In any other situation it turns 
out that 



p(p A ,p B \C,t,F) 



n 



1 3 "3 



ff+t 



) 



-T,n L 



N-T 
-T,N-i 



10 



A. TANCREDI AND B. LISEO 



(3.3) 



nJ=iW/ 



T\(n J 



T)\{n B -T)\ ' 

The distribution in (3.3) has the following interpretation: the first term is 
the joint distribution of the sample counts f A and f B , say, p(f A , f B \C, t, F); 
it can be obtained by observing that, given the vector t, there are already tj 
elements in the category Vj, j = 1, . . . , k. Then, out of the total number of 
partitions of the N — T elements actually sampled in three disjoint sets 3 
of sizes n A — T, n B — T and N — n A — n B + T, one should only consider 



those where category Vj respectively appears / ■ — tj , /, 



B 



tj and Fj 



f B + tj times in the three sets, for j = 1, . . . , k. The other term in (3.3) is 
the conditional distribution p(fi A , /j, B \f A , f B ,C,t, F); given f A and f B , the 
matching matrix C and the vector t, there are 

T\{n A -T)\{n B -T)\ 

possible permutations of the elements of the two samples: among them, 
there are Yljtjl(f A — tj)\{f B —tj)\ permutations which exactly reproduce 

the orderings given in [i A and \i B . 

The prior distribution for C and t should reflect the random selection 
mechanism of the two samples. Conditionally on t and F, C has a uniform 
distribution on the set of all possible matching matrices with T matches. 
Loosely speaking, in the absence of information about fi A and fx B , all the 
possible couples are equally likely to be a match. Then, we have p(C,t\F) = 
p(C\t,F)p(t\F) with 



p(C\t,F)=p(C\t)={ 



0, 



n 
T 



n B 
T 



ab j 

otherwise. 



To derive the distribution p(t\F), one can observe that, given T, t is a vector 
of counts of the Vj categories in a simple random sample of size T drawn from 
the population. Then p(t\T, F) is a multivariate hypergeometric distribution. 
Finally, T, the total number of common units across the two samples, is 
a scalar hypergeometric random variable. Then, 



(3.4) 



p(t\F)=p(t\T,F)p(T\F) 



n 

3=1 



F; 



n A 
T 



N - 



n 



n 



N 
n B 



3 They respectively represent the "nonmatch" for samples A and B and the "nonsam- 
pled" units. 
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It is easy to see that, by averaging out over C and t in the distribution 
p(n , fi B ,C,t\F), one re-obtains the model expressed by (3.1) and (3.2). 
Details are given in Appendix A. For the moment notice that the use of the 
hypergeometric distribution p(T\F) in (3.4) is standard practice in capture- 
recapture modeling when the number T of common units across two samples 
is known [Darroch (1958), Seber (1986) and Marin and Robert (2007)]. 

At the bottom of the hierarchical model, one needs to specify the prior for 
the vector F; this is equivalent to assuming that the finite population which 
the two samples are drawn from is itself a random sample from a superpop- 
ulation model [Ericson (1969)]. In particular, following Hoadley (1969), we 
assume that, conditionally on N and a vector 9 = (9±, . . . , 9}~), with < 9{ < 1 
and Yli=i @i = 1' is a multinomial random variable, 

p(F 1 ,...,F k \9,N)= • IK'" 

i=i 

Regarding the prior for N, we suggest the following family of noninformative 
priors: 

p g (N)cxT(N-g+l)/N\, g>0; 

the hyperparameter g regulates the shape of the prior: the larger the value 
of g, the lower the prior weight on the right tail, which is integrable for all 
g > 1. The same prior model for F can be expressed by assuming that, for 
a fixed hyperparameter A > and 9, the population counts Fx,...,Fk are 
independent Poisson variables with rates \9\, . . . , A#/% and p g (X) oc 1/A 9 . 

Last, we assume that the prior for 9 is obtained first by modeling its el- 
ements via the product of marginal and conditional probabilities based on 
a specific association pattern for the key variables and then by putting inde- 
pendent Dirichlet distributions to each probability vector characterizing the 
resulting model for 9. A special case of this product of Dirichlet distributions 
is the hyper-Dirichlet prior which is used in the similar context of disclosure 
risk assessment by Forster and Webb (2007); see also O'Hagan and Forster 
(2004). Moreover, the "measurement error" parameters /3 are independent 
and uniformly distributed random variables; they are also independent of 
all the other model parameters. To sum up, the joint distribution of all the 
variables is expressed by the following factorization: 

p(x A , x B ,fi A , ^ B ,f3, C, t, F, N, 9) = p(x A , x B \n A , fi B , p) 

xp(n A ,fi B \F,C,t)p(C\t)p(t\F) 

xp(F\9,N)p(N)p(9)p(p), 



and a representation in terms of a directed acyclic graph is displayed in 
Figure 1. 
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Fig. 1. DAG representation of the joint probability model described in Section 3. 

4. Bayesian implementation. In this section we discuss a Metropolis within 
Gibbs algorithm for simulating from the joint posterior distribution p(ft A , ji B ', 
(3,t,F,N,9 | x A ,x B ); see Robert and Casella (2004) for a general overview 
about the MCMC theory and implementation. In our case the Gibbs algo- 
rithm structure is based on the following updating steps: 

V A ,v B ,t\F,N,d,p, 

F,N\v A ,v B ,t,9,P, 
9\fi A ,ft B ,t,F,N,(3, 

(3\n A ,H B ,t,F,N,6. 

When the matching matrix C is itself one of the parameters of interest, 
one can simply add, at each iteration of the algorithm, a draw from the 
conditional distribution 

P (C\fi A f, B ,f3,t,F,N,e,x A ,x B ). 

Details about this conditional distribution are given later in this sec- 
tion. 

To illustrate the first updating step, notice that 

p(p A , /j, b , t\F, 9, f3, x A , x B ) = p(n A \F, 0, x A )p(n B \F, P, x B ) 

xp(t\ f i A , f x B ,F,d,p,x A ,x B ). 
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Moreover, by using results from Appendix B, 



P (t\fi A ,fx B ,F,9,P,x 



(4.1) 



A x B ) 



: P (t\v A ^ B ,F) 



n 



m (51) 



\fB) 
J 3 



Thus, conditionally on all the other quantities, £i, . . . , t& are independent hy- 
pergeometric random variables. Then one should separately draw \i A and \i B 
from p([i A \F, (3, x A ) and p(fi B \F, (3,x B ) and t from (4.1). However, the direct 
simulation of [i A and [i B is not straightforward. To see why, let ^i^,...,^ be 
the population count for the category assumed by /i after eliminating, from 
the population, / units with categories //i, . . . ,/z/. Then 



p(p b \F,f3,x b ) (xp(fi b \F)p(x b \fi b ,/3) 

n s k 



X 



IP, 



=i 



=i 



{/if l =a;f 1 } 



+ (!-/% 



for 5 = ^4,5 and the direct simulation from the above distributions can be 
computationally hard. To circumvent the difficulty of directly simulating the 
entire joint distribution p(/j, s \F, (3, x s ), note that we can easily draw the full 
conditionals 



p(^\fi S _ s ,(3,x s )ocF u s 



(4.2) 



k 



n 

i=l 



for s = 1, . . . , n s and S = A, B. By simulating fi A and fi B from (4.2) follow- 
ing a Gibbs type updating and t by its true conditional distribution, we do 
not produce an exact draw from the conditional distribution of (fi A , [i B ,t). 
However, the latter is exactly the stationary distribution associated with 
the proposed step. This strategy can then be justified as an example of 
"Metropolis within Gibbs." Moreover, note that, in order to improve the 
mixing of the chain, for each iteration of the algorithm we can repeat more 
simulation cycles from the conditional distributions (4.2) in order to approx- 
imately generate, at each iteration, a random draw from the true conditional 
of (v A ,V B ,t). 

A standard Gibbs updating is possible for the second step. Consider the 
full conditional distribution of the vector F; using the results in Appendix A 
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and after some algebra, 



p(F\fj, A , v B ,t,6,/3,x A ,x B )K fJ- B \F, t)p(t\F)p(F\9) 




T(N-g + l) 



Then, random draws from the above distribution can easily be obtained by 
first simulating N from 



Subsequently, conditionally on N, one can draw v±, . . . ,Vk from a multino- 
mial distribution with parameters 9\ , . . . , 9% and size N -n A -n B + T, and 
then set Fj = Vj + f A + ff - tj . 

Incidentally, we notice that the posterior distribution (4.3) plays a cru- 
cial role also when the sample sizes n A and n B are assumed to be random 
and T is known. In fact, in this case, the vector [T,n A — T,n B — T,N — 
n A — n B + T] follows a multinomial distribution with parameters N and 
(p A p B ,p A (l — p B ),p B (l — p A ), (1 — p )(1 — p B )), where p A and p B represent 
the unknown capture probabilities in the two sampling occasions; see, for 
example, Bishop, Fienberg and Holland (1975). It follows that, for (p A ,p B ) 
unknown, inference for N can be drawn either by using the complete model 
[i.e., by introducing a prior for (p A ,p B ) and then getting the marginal pos- 
terior distribution p(N\n A , n B , T)] or, in a slightly approximate way, by 
eliminating (p A ,p B ) via a conditional argument [i.e., by using the condi- 
tional likelihood p(T\n A , n B , N)]. These two approaches typically produce 
very similar conclusions. In the former case, when assuming a uniform prior 
for (p A ,p B ), the marginal posterior of N is given by the expression for 
p(N\T) in (4.3) multiplied by (N + 1)~ 2 . In the latter case, inference is 
based only on (4.3). The complete multinomial likelihood can obviously be 
used within our approach by simply adding other Gibbs steps for (p A ,p B )- 
However, as in the case with known T, we do not expect to see substantial 
differences, and in the rest of the paper we will consider n A and n B as fixed. 

The updating of 6 can be done in a standard way since 




p(% A , /A t, F, N, 0, x A , x B ) oc p(0)p(F\9) 
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and the independent Dirichlet distributions characterizing p{9) are conjugate 
to p(F\6); see O'Hagan and Forster (2004). Finally, note that the conditional 
posterior density for ft is proportional to 

(ft + (1 - {hVkif^O- - p iT A + n*-n?^ 

where n AB is the total number of sample units where the observed value and 
the true value coincide for the zth key variable. One can easily see that the 
posterior distribution of rji = ft + (1 — conditionally on all the other 

variables, is Beta(n^ B + 1, n A + n B — h AB + 1) truncated on the set (k^~ ,1). 
Then we draw r/j from its Beta distribution and set ft = (A^j — l)/(ki — 1), 
for i = 1, . . . , k. 

4.1. Matching matrix simulation. In order to specify the conditional dis- 
tribution of C given all other quantities involved in the model, we introduce 
the sets Aj = {a: /i A = Vj} and Bj = {b : fi B = Vj}. In words, Aj is the set 
of units in sample A whose true value belongs to category Vj) these sets 
depend on \x A and pP . Let Cj be the block of the matrix C corresponding 
to the rows in Aj and the columns in Bj. Conditional on the true values, 
fi A and fi B , C a b = for each couple such that fi A ^ fi B ; then, outside the 
blocks C\, . . . , Ck, the elements of C will be equal to 0. Thus, 

k 

p(C\^ A , f i B ,t,F,e,x A ,x B ) = l[ P (C j \t ] J A J B ), 

i=i 

where p(Cj\tj, f A , f B ) is the discrete uniform distribution over the set of all 
possible configurations for the block Cj with exactly tj matches, 



p{Cj\tj,f?jf) 



fB 



Note that, by conditioning on the drawn values of the key variables, we au- 
tomatically create a blocking method able to limit the number of candidate 
matches. Blocking strategies are very popular in record linkage literature. 
They basically consist of a partition into homogeneous groups of all the 
possible comparisons among records in order to reduce the computational 
burden; see, for example, Newcombe (1967) or Winkler (2004). Within our 
approach the homogenous groups of records are identified at each step of 
the algorithm by the block matrices Cj's. 

4.2. Matching matrix estimation via MCMC algorithm. Now we describe 
inferential strategies for producing a "point estimate" in a record linkage 
analysis. The usual output of an MCMC based analysis is a sample of ap- 
proximately independent "observations," simulated from the posterior distri- 
bution. This sample can be used to obtain a representation of the uncertainty 
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about the parameters of interest, mainly the matrix C or N. In addition, 
record linkage procedures are often the first stage of a more complex statisti- 
cal analysis: they represent the crucial step of creating a suitable data set to 
be used afterward. In terms of statistical theory, this is equivalent to produc- 
ing a point estimate of C, from which we select the "declared" matches. Clas- 
sical inference methods usually provide plug-in estimates, based on theories 
developed in Fellegi and Sunter (1969) and Jaro (1989). First, the previously 
defined parameters m and u are estimated and then a sequence of statistical 
tests is performed in order to decide whether each pair (a, b) € A x B can 
be declared a match or not. The power of multiple tests is calibrated in 
order to obtain a specific level of the False Match Rate (FMR), that is, the 
ratio between the number of false matches and the total number of declared 
matches. Note that the FMR is exactly equivalent to the well-known False 
Discovery Rate [Benjamini and Hochberg (1995)], very popular in multiple 
comparison applications (wavelets theory, microarray analysis, etc.) Further- 
more, currently used record linkage procedures must complete the statistical 
data analysis with a reallocation procedure which eliminates inconsistencies 
among the results of different tests [see Jaro (1989) and the problem posed 
by Larsen (1999), paragraph 3.3]. 

The Bayesian way of facing a record linkage problem is different in spirit, 
and suggests interesting issues, both from a practical and a methodologi- 
cal perspective. Although in a formal Bayesian analysis one should select 
the point estimate as the one minimizing the posterior expected loss, it 
is common practice, in applications, to use the posterior mean or, some- 
times, the posterior median. Of course, these solutions do not appear rea- 
sonable in a record linkage context: the marginal posterior mean of each 
single element of the matrix C will be a number between and 1, which 
does not help much in deciding whether the pair (a, b) is a match or not. 
The use of the posterior median is even more complicated in multivariate 
discrete settings. Thus, a formal decision theoretic approach seems neces- 
sary: let G = {G a b} £ G, a = 1, . . . , n A and 6=1,..., n B , a generic matrix of 
size n A x n B , with the same characteristics as C, such that it represents our 
"action." Here Q represents the set of all possible actions. Also, let L(-, •) be 
a loss function defined as L : Q x C — > R + where C is the set of all possible 
matching matrices. Our goal is to select, for a given loss function L, the 
optimal decision G* , the one which minimizes the posterior expected loss 

G* = argminVF"(G) 

where W(G) = E[L(C,G)\x A ,x B ]. In what follows we will consider some 
specific loss functions: 
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(1) Quadratic Loss 

L g (C,G) = ^2^2(C ab -G ab )2. 

a b 

Since the elements of C and G are either or 1, L q is equivalent to 
the L ± loss: L 1 (C,G) = ^ a ^ b \C ab -G ab \. 

(2) False Match Rate 



0, ifJ^G a6 = 0, 

L FMR (C,G) = ( EaEbGa6 /( Ca6 = 0) 



a b 

otherwise. 



Lpmr translates, in terms of decision theory, the classical use of the 
False Match Rate as a measure of performance of the record linkage 
analysis. 
(3) Absolute number of errors 

£ABS(C, G) = £[Ga6/(Ca6 = 0) + (1 - G ab )I{C ab = 1)]. 
a 6 

The following theorem provides the optimal solution for the above mentioned 
losses. 

Theorem 4.1. 

(A) Under losses L q and LabSj the optimal Bayesian solution is given by 
the matrix G* , defined as 

G *fl, ifp(C ab = l\x A ,x B ) > \, 
ab \ 0, otherwise, 

a= 1, . . . , n A ; b = 1, . . . , n B . 

(B) Under loss I/fmr? the optimal solution is a matrix consisting of all 
zeros. 

Proof. First, notice that I{C ab = 1) = C ab and I{C ab = 0) = 1 - C ab . 
(A): Since 

L q (C, G) = ^ ^[C a fe + G ab - 2C ab G ab ], 

a b 

the problem is equivalent to the maximization of the posterior expected 
value of 



L q (C,G) = 2^2YG ab 



a b 



Cab- j 
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With the loss Labs, simple calculations lead to 

L ABS (C, G) = £^[G o6 (l - C ab ) + (1 - G ab )C ab ] 



EE^ 



^GnhCnh + C. 



ab\ 



The minimization of the posterior expected loss of Labs is equivalent to the 
maximization of the quantity 



L q {C,G) = 2Y J Y, G ^ 



Cab- 2 



Then the quantities L q and Labs are identical and it will be sufficient to 
find the optimal solution for L q . We need to maximize 



T ab 



c n 



x A ,x B 



p(C ab = l\x A ,x B )-^ 



The last expression shows that the value that maximizes W q {G) is obtained 
by setting G ab = 1 if and only if the correspondent coefficient is positive, 
that is, when p(C ab = l\x A ,x B ) > |. 

(B): When Lfmr is used, it is easy to see that FMR is minimized by 
adopting the conservative behavior of not declaring any match! In this case, 
in fact, the posterior expected loss is always zero, independently of the pos- 
terior distribution. Then the optimal solution is given by G* ab = 0, for all 
(a, b). □ 

It is important to stress that all the optimal solutions derived in Theo- 
rem 4.1 are based on the marginal posterior probabilities of being a match 
for the various pairs (a, b). This is a consequence of the fact that the above 
loss functions are additive and they basically "sum" over all the losses due 
to the single mismatches. 

Part B of Theorem 4.1 is also important. It says that, from a decision the- 
oretic perspective, the FMR is not a valid measure of performance, because 
it only controls one type of error. Every reasonable loss function should also 
take into account a measure of the number of undiscovered matches [Gen- 
ovese and Wasserman (2003)]. In this sense, a reasonable loss function for 
record linkage may be given by the Global Error Rate 

EaE^ 1 - G ab )Ic ab =i{C ab ) 



Ltot(C, G) = Lfmr(C, G) + 



HaYsbi 1 ~ Gab) 
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The loss Ltot is actually able to capture errors due to missing true matches. 
However, the improvement is more theoretical than practical: in fact, the 
denominator of the second factor is so much larger than the denominator 
of Lfmr that the results obtained using Ltot should not be practically 
different from those derived under loss Lfmr- 

5. Illustrative application. We illustrate our approach in detail with the 
real data set already used in the Introduction. The two files consist of n A = 
34 records from a single block of the last Italian census population survey and 
n B = 45 records from the same block relative to the post enumeration survey; 
more details can be found in Alleva, Fortini and Tancredi (2007). Records 
in both files refer to foreign residents only, which typically represent an 
example of an elusive population. For each file, we take three key variables: 
X\ represents the first two consonants of the family name with 339 observed 
categories (considering all blocks), X2 represents the gender and X3 is the 
education level, with 17 categories. The total number of entries in V is 
k = 11,526. The data and the programs [written in C and R, R Development 
Core Team (2009)] that have been used for this application are available in 
the supplementary material [Tancredi and Liseo (2011)]. In practice, real 
applications may have more key variables, more blocks and larger sample 
sizes. However, focusing on a small example allows us to illustrate better 
some details of our methodology compared to the existing approaches. 

The hyperparameter g appearing in the prior distribution p(N) has been 
set equal to 2 in order to have a proper prior. The Dirichlet distributions 
for 9 are chosen so that, at the superpopulation level, X\ is independent of 
{X2,X%). We also assume that all the Dirichlet distributions are uniform in 
their supports. 

We have used the algorithm described in Section 4 to generate a single 
Markov chain of length 100,000. See the supplementary material for a graph- 
ical representation of some of the simulation traces. Figure 2 shows the pos- 
terior distributions of the following quantities: (a) the number of matches T, 
(b) the total population size N, (c) the measurement error parameter vector 
[li, (i = 1,2,3), (d) the probability of selecting a male within the block at 
the superpopulation level, 9.%.. In panel (d), we also show the posterior dis- 
tributions of 9.1. obtained by considering the two files separately, assuming 
a uniform prior and independence among the units. Notice that the poste- 
rior density of 9.\. can be graphically interpreted as an average of the two 
posteriors one would have obtained from the analysis of each single data set. 

The posterior estimated quantiles of level (0.05, 0.5, 0.975) for T are (26, 28, 
31). The same posterior summaries for N are (49, 55, 65). Marginal posterior 
probabilities of being a match, p(C a b = l\x ,x B ), are graphically displayed 
in panel (a) of Figure 3, where the cases have been sorted in order to have the 
most probable matches on the diagonal. There are only 34 pairs of records 
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(out of 1530) such that p(C a b = l\x A ,x B ) is larger than 0.1. The estimated 
matching matrix, using the quadratic loss function outlined in Section 4, 
is given by the 27 matches visible on the diagonal. Notice that inference 
about C is quite robust with respect to the choice of the hyperparameter g: 
when g = 1 we obtained exactly the same estimated matching matrix, while 
setting g = 3 would produce one more match. 

We now compare our results with other possible approaches based on the 
comparison vectors y a b whose frequency distribution is given in Table 2. As 
a first alternative we consider a slight modification of the Bayesian approach 
proposed by Larsen (2005) where y a b is marginally distributed as (2.1) and 
the matching matrix C satisfies the constraints Y^a^ab < 1 and Ylb^ab < 1- 
We use uniform priors for m and u. Unlike Larsen (2005), we have as- 
sumed, for the matching matrix C, the same prior distribution used in our 
approach. We will call this model the "Jaro constrained model." The pos- 
terior distribution for the parameters (m, u, C, N) can easily be simulated 
by using Gibbs steps for [m\u,C, N], [u\m,C,N] and [N\u,m,C]. To update 
the matching matrix C, we use the Metropolis-Hastings step proposed by 
Green and Mardia (2006). Figure 4 reports the posterior distributions of 
the parameters p = T/(n A ■ n B ), m and u. The posterior quantiles of level 
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Fig. 3. Matching estimation. Panel (a) shows the posterior probabilities p(C a b = l\x A ,x B ) under the new model. Panel (b) shows the 
posterior probabilities p(C a ,b = l|j/n, ■ ■ ■ ,2/„a „b) under the Jaro constrained model. Panels (c) and (d) show the posterior probabilities 
p(C a b = l\yab) and the estimated matching matrix using the classical approach. Values of the posterior probabilities are indicated by the 
shading scale at the right of each panel. 
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Table 2 



Results of the classic approach 



Frequency 



p((a,b) e M\y ab ) 



A 



(0,0,0) 
(1,0,0) 
(0,1,0) 
(1,1,0) 
(0,0,1) 
(1,0,1) 
(0,1,1) 

(1.1.1) 



659 
20 

601 
13 
78 
8 

126 
25 



0.00 
0.01 
0.00 
0.05 
0.23 
0.80 
0.56 
0.94 



0.01 
0.14 
0.04 
0.58 
3.43 
45.20 
14.81 
194.97 



The first two columns give the distribution of the comparison 
vector. The last two columns report the estimated quantities 
(2.3) and (2.2). 



(0.05,0.5,0.975) for T are estimated as (23,27,31). The same posterior sum- 
maries for N are (49,57,72). The marginal posterior probabilities of being 
a match, p(C a b = 1 |yn , . . . , y n A n a), are graphically displayed in panel (b) 
of Figure 3. Also in this case we have exactly 34 pairs of records such that 
p(C a b = • • • ,y n A n B) is larger than 0.1, but the matching matrix ob- 
tained with the quadratic loss provides 25 matches. In general, our proposed 
model and the Jaro constrained model provide similar estimates, although 
the latter seems to produce slightly more uncertainty as shown by the larger 
interval estimates for both T and N. 

Finally, we show the results obtained by considering model (2.1) without 
row or column constraints on the matching matrix C. Maximum likelihood 
estimates and posterior densities are reported in Figure 4. The matching step 
is performed by considering the posterior matching probabilities (2.3) or the 
likelihood ratios (2.2). In Table 2 we report these quantities obtained with 
a simple plug-in of the maximum likelihood estimates of the parameters. 
The posterior probabilities p((a,b) G M\y a b) are also displayed graphically 
in panel (c) of Figure 3. In this case there are 237 pairs with a posterior 
probability p( (a, b) G M\y a i,) greater than 0.1. The higher number of potential 
matches is almost certainly due to the fact that, in this approach, because 
of the independence assumption among comparison vectors and the absence 
of constraints on the C matrix, the marginal matching probabilities only 
depend on the information retrieved from the single comparison and not, 
as in the previous models, on the information provided by the entire data 
set. To rule out multiple matches, following Jaro (1989), we maximize the 
function 




A 



D 



(5.1) 
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Fig. 4. Posterior distributions for the parameters of model (2.1) with the constraints 
on the matching matrix C (solid line) and without (dotted line). For the latter case the 
constraint p < 1/2 has been used to guarantee identifiability and the vertical lines indicate 
maximum likelihood estimates. 



subject to the constraints J2a=i z ab < 1 V6, Ylb=i Zab — ^ ^ a an< ^ Zab e (0> 1} 
V(a, b). The final answer produces 29 matches displayed in panel (d) of Fig- 
ure 3. From Table 1 one can see that, by setting T = 29 in the hypergeo- 
metric likelihood ( n T ) (^b ™ T )/( n ^) and using the prior p(N) oc 1/iV 2 , one 
gets a 95% credible interval for N equal to [50,60], which is a subset of the 
intervals obtained using our approach or the Jaro constrained model. 

6. Multiple block application. In this section we illustrate the results 
obtained with a more realistic exercise involving a multiple block scenario. 
In particular, we repeated the analysis described in the previous section for 
each census enumeration area (census block) also selected for the post enu- 
meration survey and including at least one foreign person during the census 
survey. This way we obtained a list with 337 pairs of data sets for a total 
of 3675 records taken on foreign people during the 2001 census population 
survey and 3404 analogous records originating from the parallel post enu- 
meration survey. The block sizes vary from a minimum of one individual on 
at least one occasion to a maximum with 280 and 311 individuals on the 
two occasions. Note that the total number of blocks selected for the post 
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enumeration survey is 1098, corresponding to 0.31% of the total number of 
the Italian census enumeration areas. 

For each pair of data sets we performed a record linkage analysis in order 
to estimate the total number of foreign people iVj living in the Ith. census 
block, for 1 = 1,..., 337. In addition to the three key variables considered in 
the single block analysis outlined before, we also considered the age (coded 
into 10 categories). At the superpopulation level in our hierarchical model 
we assumed the surname to be independent of gender, education level and 
age. The probability vector for the surname categories is assumed, as before, 
to be uniform in its support. For the 340 = 2 x 17 x 10 joint probabilities of 
the other three key variables we set the Dirichlet hyperparameters all equal 
to 1/340 in order to avoid marginal distributions that are too concentrated. 

In the upper left panel of Figure 5 we show, for each block with approxi- 
mately at least 25 matches, the box-plot for the posterior distribution of Ni 
given by our approach. For each pair of data sets we also implemented the 
other two approaches described throughout the illustrative example, namely, 
the Jaro constrained model and the hybrid strategy obtained by estimating 
the matching matrix via the classical approach and then plugging in the 
estimated match number in the posterior distribution of population size. 




^00 200 301) 400 ° 1500000 2000000 2500000 3000000 



Fig. 5. Left panels: box-plots from the posterior distribution of the foreign population 
size for blocks with at least 25 matches. Right panels: posterior distribution of the foreign 
population size in Italy at the end of 2001. Upper panels: new model. Central panels: Jaro 
constrained model. Lower panels: Jaro unconstrained hybrid approach. 
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The box-plots for the posterior distribution of JVj obtained with these two 
approaches are shown, respectively, in the central and lower left panels of 
Figure 5. Note that the posterior distributions for iVj provided by the Jaro 
constrained model give point estimates similar to those obtained with our 
approach but with slightly wider credibility intervals. Instead, the pattern 
shown by the hybrid strategy is quite different. In particular, when com- 
pared to the other approaches, it shows a remarkable under-estimation of 
the block sizes. In fact, the maximization of the function (5.1) leads to 
an over-estimation of the true match number. However, introducing a false 
match rate correction as in Belin and Rubin (1995) would reduce the dis- 
tance from the other approaches. Nevertheless, there is a clear message that 
ignoring the matching uncertainty would give a false impression of accuracy 
for the estimates. 

The same conclusions are emphasized when we aim at estimating the 
quantity N = Ef=i JVj]/0.0031 which can be seen as a rough approxima- 
tion for the size of the foreign population in Italy at the end of 2001. The 
histograms shown in the right panels of Figure 5 have been obtained by 
summing the draws from the posterior distributions of Ni for each block 
with at least 2 records in both the surveys. In fact, smaller blocks tend to 
produce quite diffuse posterior distributions, making the MCMC inference 
difficult without introducing a more concentrated prior. To overcome this 
problem, the population size for the smaller blocks has been fixed equal to 
Ni = (nf + l)(nf + 1)/(TJ + 1) with TJ estimated by the classical approach. In 
particular, one can notice that accounting for matching uncertainty with the 
Jaro constrained model (central right panel) produces both a larger estimate 
and larger uncertainty with respect to our approach (upper right panel). 

7. Simulation studies. We now evaluate our hierarchical model via a sim- 
ulation study. Artificial data are often used to evaluate record linkage tech- 
niques, especially in computer science literature; see, for example, Christen 
(2005) and Christen and Pudjijono (2009). Here, we consider three main dif- 
ferent simulation scenarios generating, at the superpopulation level, three 
and six independent key variables (scenarios 1 and 3) and three dependent 
key variables (scenario 2). Common features across different simulations are 
as follows: 

• the population size, fixed at N = 100. 

• the sample size; we always assume n A = n B equal to 70,80,90. 

• the measurement error parameters /3's: their value has been fixed at 
(0.85,0.90,0.95). 

In the first two scenarios, the three key variables assume, respectively, 64, 
16 and 4 categories, leading to a contingency table with 4096 entries. In the 
independence case the means of the population frequencies Fj have been 
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set equal to Oj = Oj 1 ,j 2 ,j-i = X\i=i where bj i oc % with jj = 1, . . . , hi for 
i = 1,2,3. Under the dependence model we set Oj = Qj lt j 2 ,j3 = bj 3 bj 2 \j 3 bj 1 \j 3 

where = bj 3 oc j$, bj 2 u- 3 oc j^ 3 and 6j x u a oc j\ 3 ■ Finally, in the third scenario, 
the 6 key variables assume, respectively, 32, 16, 4, 4, 2 and 2 categories, 
leading to a contingency table with 32,768 cells with Oj = Oj x ,...,j % = IIi=i bji 
where bj i oc j j . 

For each combination of model parameters, we have generated 100 pairs 
of data sets. Each pair of data sets has been analyzed using our hierarchical 
model with 45,000 iterations of the MCMC algorithm and 5000 iterations 
discarded for burn-in. For each pair of data sets we also implemented the 
Jaro constrained model and the hybrid strategy described in the previous 
sections. Mixing and convergence rates were satisfactory based on the ex- 
amination of trace plots. 

In Table 3 we focus on the inference for N. For each of the three ap- 
proaches and for each group of 100 pairs of data sets we report the average 
values of the posterior mean of N, the estimated coverage of the 95% cred- 
ibility intervals and their mean length. Estimated standard errors are also 
given in parentheses. Note that in our approach the average value of the 
posterior mean for N is, in almost every experimental condition, the closest 
to the true value N = 100 and the one with the smallest standard error. 
However, the reduced bias of our approach was to be expected because the 
simulation generating process is exactly part of our model, while the other 
approaches present several misspecification elements. Note also that the Jaro 
constrained model and the hybrid approach have different behaviours, the 
former overestimating N and the latter underestimating it. This is the same 
trend already observed in the multiple block application. 

The performance of the alternative approaches does not improve when 
considering the interval estimates. In fact, with few exceptions, our approach 
produces the interval estimates with a coverage level closest to the nominal 
one. The hybrid approach, as expected, has dramatically low coverage level 
since it does not account for matching uncertainty. The Jaro constrained 
model always produces interval estimates wider than those provided by our 
model, partly because it only retrieves from the data the marginal informa- 
tion given by the comparisons. 

It is also interesting to note the behavior of the estimates with respect to 
the information carried by the data. When the sample sizes or the number 
of key variables increase, the uncertainty about N reduces with all three 
methods. In addition, both our model and the Jaro constrained model show 
a decrement in uncertainty as the measurement error level decreases, that 
is, when the fa's approach 1. 

In Table 4 we report the results regarding the estimation of the matching 
matrix C. In particular, for each method we show the average value of the 
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Table 3 

Simulation study for evaluating the posterior mean E(N) and the 95% credible interval 
under the new model (Mi), the Jaro constrained model (M2) and the Jaro unconstrained 

hybrid approach (M3) 
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Each cell reports a mean obtained with 100 pairs of data sets drawn from our model with 
iV = 100, n A — n B = n s and /?! = ••• = fa = • • • = fj h . Standard errors are in parentheses. 
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Table 4 

Simulation study for evaluating the false match rates under the new model ( columns 
Mi ), the Jaro constrained model ( columns M2 ) and the Jaro unconstrained hybrid 

approach ( columns M3 ) 

FMR1 FMR2 

/3i n s Mi M 2 M 3 Mi M 2 M 3 
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Each cell reports a mean obtained with 100 pairs of data sets drawn from our model with 
N = 100, n A — n B = n s and /3i = ■ ■ ■ = /3i = ■ ■ ■ = fin- Standard errors are in parentheses. 



where C is the point estimate obtained using the quadratic loss. The results 
of the comparisons among different methods would depend upon which type 
of FMR is used. In particular, under the FMR1 criterion, the better per- 
formance is established by the Jaro constrained model, followed by our ap- 
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proach and by the hybrid approach. However, one should recall that the Jaro 
constrained model tends to overestimate N and, consequently, it leads to 
a potential under-estimation of T = Y^ab^ab- This way, the FMR1 criterion 
would prefer the Jaro constrained approach, because of its "conservative 
behavior." Prom our perspective this is another argument in favor of the 
inadequacy of FMR1 as a single measure of performance of record linkage 
procedures. Finally, note that, when using the FMR2 criterion, the hybrid 
approach quite often shows the better performance with our model produc- 
ing a lower rate than the Jaro constrained model under the independence 
assumption. 

8. Discussion. Record linkage techniques pose several interesting prob- 
lems both from the methodological and the computational viewpoint. From 
a methodological perspective, the definition itself of the statistical frame- 
work within which comparisons among records should be performed is still 
under debate: in this paper we have proposed a novel Bayesian methodol- 
ogy- 
While it is definitely true that the result of a statistical analysis produced 
by an official organism must be objective (or — at least — it should be per- 
ceived as such by the users), it is also undeniable [see Fienberg (2011)] that 
Bayesian ideas and techniques can play an important role in official statis- 
tics, especially when important prior (or extra-experimental) information 
about the variables of interest exists and cannot be adequately exploited in 
a classical inference framework. In addition, even when prior information 
is lacking, a Bayesian analysis may be necessary simply because a classical 
approach cannot provide answers without introducing strong assumptions, 
not easily testable. In these situations a Bayesian analysis allows us, at least, 
to perform a sensitivity analysis, with the aim of quantifying the influence 
of the assumptions on inferences. 

From a computational perspective record linkage problems become formi- 
dable as soon as the sizes of the files are large. The intensive simulation 
methods required by any Bayesian approach for a matching problem make 
the computational problems in real applications even more crucial. One of 
the most popular solutions, valid also for our approach, is to perform the 
record linkage only between those records which show the same values on 
some blocking variables which are assumed to be recorded without errors. 
In addition, parallel computations for separated blocks may reduce the com- 
puting time in a significant way. 

The proposed model is built up on the actually observed categorical vari- 
ables drawn from a finite population and no reduction of the available in- 
formation, for example, by using Boolean comparison vectors, takes place. 
We also stress that prior information, provided by experts or by previous 
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surveys, can be introduced naturally into the record linkage process via the 
superpopulation model, for example, by giving specific association patterns 
between the key variables. Another important benefit is the acknowledg- 
ment and incorporation of the matching process uncertainty in estimating 
the population size as well as other population parameters. At the same 
time, the information available about the population parameters and their 
uncertainty are accounted for in the record linkage. 

Throughout the paper we have made some specific assumptions, such as 
the fixed sample sizes or the uniform distribution for the misspecified record 
fields and their conditional independence given the true values. Anyway, we 
are confident that our framework may provide a basis for several extensions 
with more general assumptions. In particular, some of the capture-recapture 
models used for a closed population [see, e.g., Wolter (1986) and Fienberg, 
Johnson and Junker (1999) or Erosheva, Fienberg and Joutard (2007) and 
Manrique-Vallier and Fienberg (2008) for more advanced proposals] could 
be incorporated as sampling models for the sample sizes and the number 
of recaptures. Multiple recaptures could be handled following Ruffieux and 
Green (2009), where a method for aligning multiple unlabeled configura- 
tions has been proposed. By assuming an exchangeable prior for f3i, . . . , /3h, 
we may also remove the assumption of conditional independence for the 
measurement error among record fields. In addition, different measurement 
error probabilities across files may be considered. Note also that the model 
has been developed so that each block is separately evaluated. However, fol- 
lowing Larsen (2005), we could allow a "borrowing of strength" effect across 
the blocks by introducing some extra layers in our prior modeling. Some of 
these extensions will be the object of future research. A similar approach 
for handling multivariate normal data is discussed in Liseo and Tancredi 
(2009). 

An important aspect of record linkage procedures which we have not 
addressed here is that of the nonrandomness of the samples, for example, 
in applications using administrative lists provided by register offices. This 
issue has some consequences in every modeling approach to record linkage; 
however, discussion about these problems is beyond the scope of this paper. 
In any case, we believe that the idea of a Bayesian superpopulation model 
generating the lists might be useful in this context too. 

Finally, note that the computer science literature on record linkage (also 
known as data matching or entity resolution) has developed, in recent times, 
some impressive algorithms based on machine learning and graph-based 
matching. Some relevant papers are Bhattacharya and Getoor (2007) and 
Kalashnikov and Mehrotra (2006) and it would be interesting to compare 
these or similar approaches with the statistical models presented in this 
paper. 
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APPENDIX A 



The sampling models (3.1) and (3.2) can be obtained as the marginal 
distribution of p(fi A , fi B , C, t\F) = p(^ A , fi B \C,t, F)p(C,t\F). First, we aver- 
age out C, so p(fi A , fi B \t, F) = ^2 c p(fi A , /j, B ,C\t, F). In this sum we only- 
need to consider those matrices C with exactly tj matches in the block 
{(a, b) : [i A = [i B = Vj} for j = 1, . . . , k. The total number of such matrices is 



nJ=i(f)(f)*i!-Also, 

p(v A ,V B \t,F) = Y J P(v A ^ B ,C\t,F) = ^P(v A ^ B \C,t,F)p(C\t,F) 



c 



c 






Then 



J>(^, M B , t\F) = J>(m A , At, F)p(t\F) 



fi 



n A \ I n B \ 



i i 



n?-(^)(^) (g) (V)(£ra 

\n B -T)\n A -Tl \T) \n B ) 



1 1 1 



/ n A \ ( n B \ ( N \ I N \ 
{f*,...,f£) l/f,...,/fJ \n A )\n B ) 




nj=i {fi) ( f s) 
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APPENDIX B 

The derivation of the full conditional distribution of t \F, /i , [i B , 9, /?, x A , x B : 

P (t\F, fJ , A , f x B ,e,/3,x A ,x B ) 
<xp(fj, A ,Li B \F,t)p(t\F) 

oc 



\t i ( F i~ tj ) TT* 1 (\ f] ) U k ( F A (n A \(N-n A \ 

llj=i K f A_ t .) llj=i \ff- tj ) llj=i U;j ( T )( n B_ T ) 



(N-T\ (N-n A \ (N\ (N\ 

\n A -T) \n B -T) \T) \n B ) 



n fe r til ^ 

1 li=l V ^T(F _ tj . )! (/« — -/f -/« +ij ) J 



j = l \fB) 

J 
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SUPPLEMENTARY MATERIAL 

Data files and codes (DOI: 10.1214/10-AOAS447SUPP; .zip). Included 
in the supplementary material there are the following files: exampleA.dat, 
exampleB.dat and exampleV.dat contain the data used in Section 5. The 
files B. Cat. matching. example. R, example. R, functions. r, gibbs.c contain the 
codes. The file supplementary_figure.pdf shows the trace plots for the appli- 
cation described in Section 5. 
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